Visual Question Answering (VQA) is a recent problem in computer vision andnatural language processing that has garnered a large amount of interest fromthe deep learning, computer vision, and natural language processingcommunities. In VQA, an algorithm needs to answer text-based questions aboutimages. Since the release of the first VQA dataset in 2014, additional datasetshave been released and many algorithms have been proposed. In this review, wecritically examine the current state of VQA in terms of problem formulation,existing datasets, evaluation metrics, and algorithms. In particular, wediscuss the limitations of current datasets with regard to their ability toproperly train and assess VQA algorithms. We then exhaustively review existingalgorithms for VQA. Finally, we discuss possible future directions for VQA andimage understanding research.
展开▼